Sentence Complexity in French: a Corpus-Based Approach
نویسندگان
چکیده
Language complexity is a notion widely used in a number of linguistic elds and language applications, and can be described by a number of linguistic features and practical measures. This work proposes a closer, data-oriented look at sentence complexity. Starting from a number of di erent studies, we selected and implemented 52 linguistic features and measured them on a corpus of varied French texts. Using statistical methods, we identify ve underlying dimensions of sentence complexity. In addition to providing a better understanding of the phenomenon, these dimensions have been used in some information retrieval experiments.
منابع مشابه
Arabic to French Sentence Alignment: Exploration of A Cross-language Information Retrieval Approach
Sentence alignment consists in estimating which sentence or sentences in the source language correspond with which sentence or sentences in a target language. We present in this paper a new approach to aligning sentences from a parallel corpus based on a cross-language information retrieval system. This approach consists in building a database of sentences of the target text and considering eac...
متن کاملImprovement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination
Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...
متن کاملدر کاربرد تشخیص زبان گفتاری GMM-VSM در قالب سیستم GMM
GMM is one of the most successful models in the field of automatic language identification. In this paper we have proposed a new model named adapted weight GMM (AW-GMM). This model is similar to GMM but the weights are determined using GMM-VSM LID system based on the power of each component in discriminating one language from the others. Also considering the computational complexity of GMM-VSM,...
متن کاملA Hybrid Machine Translation System Based on a Monotone Decoder
In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...
متن کاملBilingual Sentence Alignment Based on Punctuation Marks
We present a new approach to aligning English and Chinese sentences in parallel corpora based solely on punctuations. Although the length based approach produces high accuracy rates of sentence alignment for clean parallel corpora written in two Western languages such as French-English and German-English, it does not fair as well for parallel corpora that are noisy or written in two distant lan...
متن کامل